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Abstract. The Resource Description Framework (RDF) pro- 
vides a common data model for the integration of "real-time" 
social and sensor data streams with the Web and with each 
other. While there exist numerous protocols and data formats 
for exchanging dynamic RDF data, or RDF updates, these 
options should be examined carefully in order to enable a Se- 
mantic Web equivalent of the high-throughput, low-latency 
streams of typical Web 2.0, multimedia, and gaming appli- 
cations. This paper contains a brief survey of RDF update 
formats and a high-level discussion of both TCP and UDP- 
based transport protocols for updates. Its main contribution 
is the experimental evaluation of a UDP-based architecture 
which serves as a real-world example of a high-performance 
RDF streaming application in an Internet-scale distributed 
environment. 
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1. Introduction 

Streaming data is an increasingly important com- 
ponent of the World Wide Web environment. Social 
networking APIs such as Twitter and Facebook pro- 
vide continuous, high-volume feeds of user content 
and activities, supporting an entire ecosystem of "real 
time" applications. Mobile devices serve as personal 
gateways for a wide variety of near-real-time sen- 
sor data. There are good reasons to integrate real- 
time data sources both with static Web data and with 
each other, and Semantic Web technologies provide a 
potential platform for that integration. For example, 
mapping real-time social data into common Seman- 
tic Web vocabularies [12] [7] enables "smarter" real- 
time queries which draw upon the wealth of general- 
purpose knowledge contained in the Linking Open 



Data cloud. 1 Bridging the gap between sensor data and 
the symbolic space of the Semantic Web [4] opens the 
door to a semantic Internet of Things, while the com- 
bination of social network data with sensor data [3] 
promises more personalized and contextually aware 
real-time services. 

The Resource Description Framework (RDF) pro- 
vides a common data model in which to express and 
combine schema-friendly information from diverse 
sources. Furthermore, various notions of RDF updates 
or changesets permit the communication of dynamic 
changes to that data, such as the posting of a photo 
or the change of a user's geolocation. Emerging tech- 
nologies such as SPARQL 1.1 2 and sparqlPuSH [10] 
provide transport mechanisms for updates. As the Se- 
mantic Web moves into this new domain, then, perfor- 
mance and scalability issues should be kept in mind 
from the start. 

The content of this paper is as follows. Section 2 will 
survey currently available RDF update formats. Sec- 
tion 3 will discuss transport protocols for RDF update 
streams at a high level. Both TCP-based and the hith- 
erto unexplored option of UDP-based update streams 
will be discussed. Section 10 will argue in favor of 
lossless data compression regardless of the choice of 
protocol. Finally, Section 5 will describe a concrete 
implementation of a distributed, UDP-based solution 
in which a volume of data equivalent to the Twitter 
Firehose is pushed from a client machine to a remote 
server 3 4 and ingested into an RDF triple store in real 



http://esw.w3.org/SweoIG/TaskForces/ 
CommunityPro ject s /LinkingOpenData 

2 http : //www. w3 . org/TR/2 9/ 
WD-sparqlll-http-rdf-update-2 91022/ 

3 The scripts and programming notes for this research are open 
source and can be found here: http : / /github . com/ joshsh/ 
laborat ory /tree /master /re sear ch/rdf stream/. 

4 A11 numerical results were derived using the following hardware 
and software: 

- sending machine: Ubuntu 9.10 Server on an Amazon EC2 
"small" virtual machine in Bloomsbury, NJ (USA) with 2GB 
RAM, 160 GB disk, and one single-core, 2.66 GHz Intel Xeon 
processor E5430 
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time, such that the data is immediately available for 
query through a SPARQL endpoint. 

2. RDF update formats 

For some applications, it may be sufficient to think 
of updates simply as streams of RDF triples. A news 
feed, for example, may describe each new story as 
a distinct resource, neither replacing nor invalidating 
descriptions which have gone before. In this case, an 
RDF update feed might be nothing more than a succes- 
sion of RDF/XML documents, or perhaps SPARQL re- 
sults. Other applications, however, are more stateful. A 
user's "mood" which changes from "sad" to "happy" is 
ambiguous if the addition of the new mood is not pre- 
ceded by the deletion of the old one. All of the RDF up- 
date formats described below support the addition and 
deletion of statements, while many of them also sup- 
port further operations such as the creation of named 
graphs or the definition of namespaces. 

For the moment, only the vocabulary component of 
these technologies will be considered. We will also 
ignore the subtle distinction between change formats, 
which express a difference between RDF graphs or 
successive states of an RDF graph, and update formats, 
which apply an update operation to a graph database. 

2.1. SPARQUUpdate 

The SPARQL/Update 5 [11] language, nicknamed 
SPARUL, comes closest to a standard update lan- 
guage for RDF graphs. Using a syntax derived from 
SPARQL, SPARUL provides several basic update op- 
erations, including statement-level insertion and dele- 
tion. See Figure 1 for an example. 

Also worth mentioning are the very similar Spar- 
qlUpdateLanguage 6 , as well the SPARQL update syn- 
tax of ARC'S SPARQL+ 7 . 

2.2. Delta ontology 

The Delta ontology[2] and Notation3 8 -based file 
format apply the notions of textual diff and patch to 



- receiving machine: Ubuntu Server 10.04 on a rack-mounted 
server in Oakland, CA (USA) with 64 GB RAM, 2TB disk, and 
eight 4-core, 2.13 GHz Intel Xeon processors E5506 

5 http : //www . w3 . org /TR/ spar qll 1- update/ 
6 http : //esw . w3 . org/ SparqlUpdateLanguage 
7 http : //arc . semsol .org/docs/v2/sparql+ 
8 http : //www . w3 . org /Design I s sues /Not at ion3 



RDF graphs, permitting the syndication of changes to 
graphs distributed among two or more peers. See Fig- 
ure 2 for an example. 

2.3. Changes ets 

Changesets 9 is a resource-oriented scheme for track- 
ing changes to an RDF graph. An update, or change- 
set, is centered on a single subject of change, such that 
the change is specific to the bnode closure, or concise 
bounded description, of that resource. 

The Changeset RDF vocabulary 10 uses RDF reifi- 
cation to express changes in terms of triples added or 
removed, and additionally includes terms to express 
meta-information about a change, including its time 
and purpose, the entity responsible, and the preceding 
change in a history of changes. See Figure 3 for an ex- 
ample. 

2.4. GUO 

The Graph Update Ontology (GUO) 11 defines an 
RDF diff in terms of triple-level insert and delete op- 
erations. Like the Changesets vocabulary, GUO ex- 
presses an update as an RDF resource, allowing ad- 
ditional metadata to be attached to the update. Unlike 
Changesets, GUO avoids RDF reification and supports 
named graphs. See Figure 4 for an example. 

2.5. GRUF 

The Guaranteed RDF Update Format (GRUF) 12 is 
a proposed plain-text format for RDF updates. While 
there are currently no software implementations of 
GRUF, it is more compact than any of the other for- 
mats described here, making it potentially appropriate 
for high- volume RDF update streams. It supports both 
triples and named graph quads. See Figure 5 for an ex- 
ample. 

2.6. Sesame RDF transactions 

The Sesame 2.0 RDF framework includes a docu- 
ment format for RDF updates which has been given the 
media type application/x-rdftransaction. Statement- 
level add and remove operations are expressed with 



9 http://n2. talis. com/ wiki/ Change sets 
10 http: //vocab. org /change set/ schema .html 
"http : / /webr3 . org/ specs/ guo/ 
12 http: //websub.org/wiki/GRUF 
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Fig. 1. SPARQL/Update example 
PREFIX dc: <http://purl.org/dc/terms/> 

DELETE { <http://example.0rg/ns#res0urcel> dc:title "Original Title" } 
INSERT { <http://example.0rg/ns#resourcel> dc: title "New Title" } 



Fig. 2. Delta ontology example 
dprefix diff: <http : //www . w3 . org/2 04 /delta#> . 
gprefix dc : <http://purl.org/dc/terms/>. 

{ <http://example.0rg/ns#resourcel> dc:title "Original Title" } 

diff : replacement 
{ <http://example.0rg/ns#resourcel> dc: title "New Title" }. 



Fig. 3. Changesets example 

<rdf : RDF xmlns : rdf="http: / /www . w3 . org/ 1 999/ 02 /22-rdf -syntax-ns# " 
xmlns :cs="http:// purl . org/vocab/ change set/ schema!" > 
<cs : Change Set rdf : about ="http : / /example . org/ change s# change 1 " > 

<cs : sub ject Of Change rdf : resource="http : / / example .org/ns#resourcel"/> 
<cs : removal> 

<rdf : Statement> 

< rdf : sub ject rdf : resource="http : / / example .org/ things#re source 1 "/> 
<rdf : predicate rdf : resource="http : / / purl . org/dc/ terms/ title" /> 
<rdf:object>Original Title</ rdf : object > 
</rdf : Statement> 
</cs : removal> 
<cs : addition> 

<rdf : Statement> 

< rdf : sub ject rdf : resource="http : / / example .org/ things tre source 1 "/> 
<rdf : predicate rdf : resource="http : / / purl .org/dc/ terms/ title" /> 
<rdf : ob ject>New Title</rdf:object> 
</rdf : Statement> 
</cs : addition> 
</cs : ChangeSet> 
</rdf : RDF> 



Fig. 4. GUO example 

Sprefix rdf: <http://www.w3.Org/19 9 9/02/22-rdf-syntax-ns#> . 
Sprefix dc : <http://purl.org/dc/terms/> . 
Sprefix guo : <http : //webr3 . org/owl/guo#> . 

_:u0 rdf:type guo : Updatelnstruction ; 

guo : target_sub ject <http://example.0rg/ns#resourcel> ; 

guo: delete _:d0 ; 

guo:insert _:i0 . 
_:d0 dc:title "Original Title" . 
_:i0 dc:title "New Title" . 
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Fig. 5. GRUF example 
set_sub ject http : // example .org/things#resourcel 
set_property http : //purl . org/dc/ terms /title 
delete text Original Title 
add text New Title 



subject-predicate-object triple patterns which include 
an optional named graph component. See Figure 6 for 
an example. The ease of parsing the simple XML- 
based format efficiently has led to its reuse in the Al- 
legroGraph triple store, 13 and consequently in the ex- 
perimental evaluation described in Section 5. 

2.7. Other formats 

Several other RDF update formats have been pro- 
posed, although they are closely tied with uncommon 
query languages. The rdfDB Query Language 14 , for 
example, provides insert and delete operations. The 
Modification Exchange Language (MEL) [8] is based 
on RDF reification and interoperates with Edutella's 15 
Query Exchange Language (QEL). The RDF Update 
Language (RUL) [6] deals with type-safe class and 
property instance level updates, interoperating with the 
query and view languages RQL and RVL. 

Furthermore, there are many additional technolo- 
gies which deal with RDF updates and change noti- 
fication, without however providing a statement-level 
update format. For example, the Triplify update vocab- 
ulary 16 alerts data consumers to incremental changes 
by providing pointers to RDF documents which have 
changed, while DSNotify 17 and the Web of Data Link 
Maintenance Protocol 18 facilitate synchronization of 
Linked Data link sets. 



SPARQL/Update, for example, is now associated with 
the SPARQL 1.1 protocol for managing RDF graphs. 
The RDF transactions format is not even a proposed 
standard, having only been intended for use with 
Sesame's HTTP protocol. 19 Changesets has its own 
HTTP-based protocol, 20 although it is also used sim- 
ply as a vocabulary for representing changes. Sparql- 
PuSH, which embeds SPARQL query results in RSS 
and Atom feeds, uses the PubSubHubbub protocol 21 to 
proactively broadcast updates to data subscribers via 
HTTP POST. 

Given the origins of the Semantic Web, is not sur- 
prising that nearly all of these protocols are based on 
HTTP. The proposed XMPP bindings for the SPARQL 
protocol 22 are an exception, while at a higher level, 
both HTTP and XMPP are usually layered upon the 
Transmission Control Protocol (TCP). It is these lower 
levels of protocol which impose the most basic con- 
straints on both the latency and throughput of RDF up- 
date streams sent over the Internet, so we will discuss 
them in the following, illustrating their well-known 
properties with small-scale experiments. 

In somewhat more depth, we will also examine an- 
other core member of the Internet Protocol suite, the 
User Datagram Protocol (UDP), and explore the con- 
straints it imposes as a carrier of RDF updates. 

3.1. Basic observations 



3. Transport protocols 

Most of the RDF update formats described in the 
preceding section are intended to be used in con- 
junction with a particular communication protocol. 

13 http : / /www . f ranz . com/agraph/allegrograph/ 
doc /new- http- server . html#header2- 235 

14 http : / / www . guha . com/ rdf db/ query . html# 
insert 

15 http : / / www . edutella .org/ edutella . shtml 
16 http : / /triplify . org/ vocabulary /update 
17 http : //dsnotify . org/ 

18 http : / /www4 .wiwiss . fu-berlin .de/bizer/ 
silk/wodlmp/ 



TCP is a reliable, connection-based protocol, which 
entails both advantages and disadvantages with respect 
to latency and throughput. Establishment of a connec- 
tion involves the overhead of an initial handshake, af- 
ter which a two-way stream of bytes flows efficiently 
between endpoints. Packets are guaranteed to arrive in- 
tact and in order. However, this requires that any lost 
packets are retransmitted, incurring additional delays. 

"http : / /www . openrdf . o r g/ doc /sesame2/ system/ 
ch08 .html 
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http: //n2 .talis. com/ wiki /Change set_ 
tocol 

http :/ /code . google . com/p/pubsubhubbub/ 
http: //danbri . org/words/2008/02/11/278 
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Fig. 6. Sesame RDF transactions example 

<transaction> 
<remove> 

<uri>http : / / example .org/things#resourcel</uri> 
<uri>http : / / purl . org/dc/ terms /t it le</ uri> 
<literal>Original Title</literal> 
<contexts/> 

</remove> 

<add> 

<uri>http : / / example .org/things#resourcel</uri> 
<uri>http : //purl . org/dc/ terms/title</ uri> 
<literal>New Title</literal> 
<contexts/> 
</add> 
</transaction> 



UDP, on the other hand, is connectionless and guaran- 
tees only that individual datagrams will either arrive 
intact or not at all, and that in indeterminate order. It 
therefore avoids the overhead of an initial handshake 
and of retransmission of lost packets, at the expense of 
reliability. 

Studies of TCP throughput for bulk transfer of data 
suggest that it is governed by a handful of factors in- 
cluding round-trip time and a path-specific probability 
of packet loss [9]. As packet loss increases, through- 
put drops according to successively higher-order ex- 
ponentials, making TCP increasingly inefficient over 
congested or otherwise lossy networks. UDP through- 
put, in contrast, drops off in direct proportion to packet 
loss. 

3.2. HTTP GET 

HTTP's GET method is primarily used to retrieve 
Web resources such as HTML pages, images, JavaScript 
documents, and so on, based on their URIs. To do 
so, a client sends an HTTP request message to the 
server, which is met with an HTTP response which is 
transmitted to the client over a persistent connection. 
Since the client does not need to re-negotiate the TCP 
connection after the initial request, a larger response 
body results in a higher proportion of data received 
to time spent. This is illustrated in Figure 7, in which 
a client has repeatedly retrieved a document of vary- 
ing size from a server. 23 If data is retrieved in 1,000- 



23 Each data point is based on 100 HTTP GET requests from a Java 
program on the sending machine to an AllegroServe HTTP server 
on the receiving machine. 



Fig. 7. Data throughput using HTTP GET 




i i i i i r 

Oe+00 2e+05 4e+05 6e+05 8e+05 1e+06 

Size of HTTP GET response entity (bytes) 



byte chunks, around 9 requests per second are possible 
in the experimental environment, or 9,000 bytes per 
second. If, however, data is retrieved in 100,000-byte 
chunks, only one request per second is possible, but 
this amounts to over 10 times as much data per sec- 
ond. In terms of throughput of RDF updates, succes- 
sive large documents - or a single, continuous stream 
- are preferable to a larger number of smaller docu- 
ments. Therefore, it should be possible to group multi- 
ple updates into a single document or stream. 
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Fig. 8. Data throughput using HTTP POST Fig. 9. Data throughput using UDP 
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Size of HTTP POST entity (bytes) 

3.3. HTTP POST 

POST requests follow exactly the same pattern as 
GET responses (see Figure 8): 24 the larger the body 
of successive requests, the higher the throughput. In 
analogy to GET, this is an argument in favor of update 
protocols which allow for multiple update operations 
per POST. 

3.4. UDP 

UDP throughput follows an altogether different pat- 
tern than that of TCP-based HTTP GET or POST (see 
Figure 9). 25 Since a UDP payload is contained in a sin- 
gle packet or datagram, its size is limited by the max- 
imum transmission unit (MTU) of the path. Most of 
the Internet is subject to the Ethernet v2 frame format, 
which imposes an MTU of 1500 bytes. As a UDP data- 
gram includes a 8-byte header, the payload should be 
no larger than 1492 bytes (indicated in the figure with 
a dashed line). All IP v4 hosts must be prepared to 



24 Each data point is based on 100 POST requests from a Java pro- 
gram on the sending machine to an AllegroServe HTTP server on 
the receiving machine. The apparent difference in absolute through- 
put for GET and POST is due to the opposite direction of flow of 
the HTTP payload between the two machines, which are subject to 
a difference in download and upload bandwidth. 

25 Each data point is based on 1000 UDP datagrams from a Java 
program on the sending machine to an Allegro Common Lisp pro- 
gram on the receiving machine. 




n ' 1 ' 1 1 1 — 

1000 2000 3000 4000 

Size of UDP payload (bytes) 



accept datagrams up to 576 bytes, so a UDP payload 
of less than 568 bytes (also indicated with a dashed 
line) is sub-optimal. Datagrams larger than the MTU 
are subject to fragmentation, with negative implica- 
tions for throughput. The second rising slope in the fig- 
ure is evidence of datagrams which have been broken 
into two fragments each. 

This constraint, together with the unreliability of 
UDP, imposes several requirements on a streaming 
RDF application: 

1 . data loss, to some extent, is acceptable 

2. it is possible to break updates into small, atomic 
transactions, such that the loss of individual 
transactions will not corrupt the RDF database 
on the receiving end 

3. communication is one-way, such the sender does 
not require acknowledgement of receipt of trans- 
actions 

4. order of delivery of transactions is not important 

UDP-based update streams are therefore not a 
general-purpose solution, but they may confer advan- 
tages, in terms of latency and throughput, for certain 
very demanding applications. For example, the frame- 
work described in Section 5 addresses a use case in 
which there is more data than available bandwidth and 
the main concern is to transmit as high a proportion of 
the data as possible. For another example, some vari- 
eties of sensor data are so time-sensitive that it is bet- 
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ter to drop lost updates than to attempt to retransmit 
them, particularly when sensors operate under less- 
than-perfect network conditions. Similarly, the fre- 
quent use of UDP for online multiplayer games hints 
at use cases for UDP-based RDF streams in real-time 
interactive environments, with potential applications 
in pervasive computing and augmented reality. 

In addition, UDP allows the possibility of IP mul- 
ticasting, in which RDF updates are broadcast from a 
single data producer to a practically unlimited number 
of data consumers. 



4. Compression 

Lossless compression of updates is found to be ben- 
eficial in all cases. However, the choice of a com- 
pression format, as well as efficient implementations 
of the compression and decompression algorithm, are 
relevant. Figure 10 illustrates the effect of three loss- 
less compression strategies on the size of a small RDF 
update document in the Sesame RDF transaction for- 
mat. 26 Beginning with an average message of over 
5,000 bytes, the document is reduced to the target size 
of a UDP datagram by two of the strategies, both of 
which support fast compression and decompression. 

Implementations of DEFLATE in Java and Com- 
mon Lisp were found to both compress and decom- 
press RDF transaction documents of 100,000 bytes or 
less in under a millisecond each, making the compres- 
sion overhead significantly less than the corresponding 
gain in throughput. 



5. Implementation and evaluation 

The framework described in this section 27 is moti- 
vated by large-scale social networking services which 
generate more data than it is possible to transport be- 
tween a single pair of widely separated Internet hosts. 
Surprisingly, by giving up on transporting all of the 
data, we are in fact able to transport more data than 
would otherwise be possible. 



Fig. 10. Comparison of compression formats 




none LZMA mini-LZO DEFLATE 

Compression algorithm 



5.1. RD Fizing the Twitter Firehose 

Twitter's Streaming API 28 provides near-real-time 
access to various subsets of public and protected Twit- 
ter data. The most privileged level of access, the Fire- 
hose stream, contains some 500 to 1000 status updates, 
or "tweets" per second, 29 where a tweet consists of 
a short snippet of text accompanied by several dozen 
fields of metadata including time and place of the up- 
date and a description of the author. 

When translated into RDF using TwitLogic, 30 a 
tweet is represented with an average of 18 triples. 31 In 
order to update the triple store appropriately (for ex- 
ample, replacing the tweet author's location with a new 
location), an average of 9 "remove" operations are also 
required, for a total of just under 30 update operations 
per tweet. In this experiment, we simulate the Twitter 
Firehose by generating a high volume of randomized 
tweets, translating them from Twitter JSON to individ- 
ual RDF transactions in the Sesame RDF transaction 
format. 



Based on a sample set of 100,000 tweets 
27 Source code for this study is available at: http:// 
forty two . net /re sear ch/rdf stream 



'http : / / dev . twitter . com/ pages/st reaming_api 

'according to community estimates 

'http : / / twit logic . forty two . net 

Based on a sample set of 100,000 tweets 
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5.2. Transporting RDF transactions 



Fig. 11. Multithreaded data ingest 



In order to validate the UDP-based technique de- 
scribed above, all RDF transactions were transported 
from the sending machine to the receiving machine in 
individual UDP datagrams, having first been DEFLATE- 
compressed. Datagrams were dispatched at a rate of 
over 1000 per second, without regard to whether they 
were received. 

The receiving machine processed successfully trans- 
mitted datagrams as quickly as possible, expanding 
and parsing the RDF transaction payload of each mes- 
sage, which is immediately applied to an AllegroGraph 
triple store, pending commit. 

5.3. Persistence 

A transactional database requires a "commit" oper- 
ation to permanently apply previously executed update 
operations. In the case of AllegroGraph, a commit is 
relatively expensive, costing at least 4 milliseconds re- 
gardless of the number of update operations to commit. 
Committing multiple tweets at a time was found to be 
more efficient, with a ratio of 100 RDF transactions, or 
3000 update operations to 1 commit practically mini- 
mizing commit overhead. 

At that point, a single Lisp client on the receiv- 
ing machine was found to process between 128 and 
269 transactions per second, 32 depending on the status 
of the background merge process: immediately after 
merging indices, the triple store accepts over twice as 
many transactions per second as it does immediately 
before completion of the merge. Write performance 
then degrades at a roughly constant rate as new trans- 
actions are committed, repeating in a sawtooth pat- 
tern for successive merge cycles. Without the use of 
a transaction buffer, it is the minimum write perfor- 
mance which defines the triple store's ability to keep 
up with an incoming stream of data. 

5.4. Multiprocessing 




3 4 5 

Number of receiving processes 



clients are created, sharing the load of data ingest. This 
experiment was performed with 1, 2, 3, 4, 6 and 8 
client processes, where each client receives UDP mes- 
sages on a separate port and the sending machine dis- 
tributes outgoing messages evenly across those ports. 
The upper line in the figure represents the combined 
maximum throughput - immediately after a merge - of 
all receiving processes. At four processes, maximum 
throughput begins to level off at around 980 transac- 
tions per second, or 98% of the throughput of the send- 
ing machine. In other words, messages are consumed 
as quickly as they are produced, disregarding packet 
loss. At eight threads, minimum throughput is around 
90% of the ceiling value, or 88% of the total stream. 

Overall, the system successfully commits around 
930 tweets per second to the remote triple store, which 
is close to the estimated volume of the Twitter Fire- 
hose. 



In AllegroGraph, data ingest can be facilitated by 
making use of multiple triple store clients, each in 
its own Lisp-based process. As shown in Figure 11, 
data throughput then increases nearly linearly as new 



all results were computed with an initially empty AllegroGraph 
triple store which grew to a size of 65 million triples by the end of the 
experiment. The results were not found to be affected significantly 
by the addition of 240 million triples of DBpedia data 



5.5. Possibilities for scalable query answering 

In the above, we have demonstrated a low-latency, 
high-throughput solution for streaming RDF updates 
and data ingest. In order to make good use of this data, 
however, real-time query capabilities are also required. 
This presents a scalability challenge if data ingest 
places high computational demands on the ingesting 
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machine. In AllegroGraph, a file-based transaction 
log offers the possibility of replicating a primary triple 
store on any number of secondary machines, which 
then share the burden of query answering among them- 
selves. This functionality has been implemented but 
has yet to be tested in a high-throughput setting such as 
the above. It requires the overhead of reading from the 
transaction log to be less than the overhead of receiv- 
ing RDF transactions over UDP. Otherwise, IP multi- 
casting provides a better solution. 

6. Conclusion 

In the above, we have surveyed two of the most im- 
portant technical choices surrounding general-purpose, 
real-time RDF data streams - namely, data formats and 
transport protocols - with an eye towards maximiz- 
ing data throughput. In the case of formats, there is no 
shortage of options, which may be distinguished from 
one another in terms of their relative compactness, ease 
of generation and parsing, and the presence of ma- 
ture implementations. Although most of these formats 
are associated with individual HTTP-based protocols, 
throughput is limited primarily by a handful of factors 
common to all of them, including message size and 
the use of data compression. In particular, protocols 
which make use of HTTP POST can dramatically in- 
crease their performance ceiling by sending an arbi- 
trary number of atoms of data per connection. Given 
current tools, lossless compression always confers a 
performance advantage. 

In addition to HTTP POST and GET, we have also 
considered an alternative, UDP-based technique for 
RDF data streaming. We have argued that it offers 
a slight performance advantage as a replacement for 
high-volume HTTP-based streams, but that it may be 
most appropriate in future real-time Semantic Web ap- 
plications for which minimal latency is the overriding 
concern. 

Finally, we have illustrated our observations with a 
real system which implements the UDP-based tech- 
nique, evaluating its performance with respect to an 
oft-cited example of a high-volume data stream, the 
Twitter Firehose. This system, which combines an 
RDF triplification tool with an RDF update stream and 
an RDF graph database, is presented as evidence that 
current Semantic Web technologies are up to the task 



3 This is not necessarily the case in the above, as we did not make 
full use of the multiprocessing capability of the receiving machine. 



of participating in highly demanding real-time appli- 
cations. 



7. Future work 

Although the update formats and transport protocols 
surveyed above serve as a starting point for the devel- 
opment of high-performance RDF streaming applica- 
tions, there are many more possibilities to be explored. 
For example, the Datagram Congestion Control Pro- 
tocol (DCCP) and the Stream Control Transmission 
Protocol (SCTP) are both message-oriented protocols 
which add congestion control, MTU discovery, and a 
measure of reliability over UDP. Alternatively, IP mul- 
ticasting may prove useful in the broadcasting of RDF 
updates by popular real-time data providers. Finally, it 
is worth noting that the goal achieved in the preced- 
ing section - that of transporting, ingesting and query- 
ing over the Twitter Firehose - is a reasonably well- 
defined yet rather informal benchmark with respect to 
throughput of RDF data. Much as concrete metrics 
such as the Lehigh University Benchmark (LUBM) 34 
have been developed to evaluate RDF graph databases 
in terms of integrity and scalability, so there is also 
a need for metrics which address throughput and re- 
sponse time of highly dynamic real-time Semantic 
Web services. 
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